Overview

Dataset statistics

Number of variables11
Number of observations90584
Missing cells48616
Missing cells (%)4.9%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory7.6 MiB
Average record size in memory88.0 B

Variable types

NUM10
CAT1

Reproduction

Analysis started2020-08-27 16:16:58.948575
Analysis finished2020-08-27 16:17:34.019913
Duration35.07 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Body has a high cardinality: 90350 distinct values High cardinality
user_id is highly correlated with df_indexHigh correlation
df_index is highly correlated with user_idHigh correlation
Views is highly correlated with ReputationHigh correlation
Reputation is highly correlated with ViewsHigh correlation
ViewCount has 48396 (53.4%) missing values Missing
ViewCount is highly skewed (γ1 = 24.95697592) Skewed
Body is uniformly distributed Uniform
df_index has unique values Unique
post_id has unique values Unique
Views has 7040 (7.8%) zeros Zeros
UpVotes has 22186 (24.5%) zeros Zeros
DownVotes has 50205 (55.4%) zeros Zeros
Score has 19927 (22.0%) zeros Zeros
CommentCount has 38051 (42.0%) zeros Zeros

Variables

df_index
Real number (ℝ≥0)

HIGH CORRELATION
UNIQUE

Distinct count90584
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean45391.35905899497
Minimum0
Maximum90882
Zeros1
Zeros (%)< 0.1%
Memory size707.7 KiB

Quantile statistics

Minimum0
5-th percentile4529.15
Q122649.75
median45372.5
Q368106.25
95-th percentile86331.85
Maximum90882
Range90882
Interquartile range (IQR)45456.5

Descriptive statistics

Standard deviation26245.53906
Coefficient of variation (CV)0.5782056234
Kurtosis-1.200790336
Mean45391.35906
Median Absolute Deviation (MAD)22728.5
Skewness0.002910903364
Sum4111730869
Variance688828320.7
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
20471< 0.1%
 
130111< 0.1%
 
436481< 0.1%
 
416011< 0.1%
 
477461< 0.1%
 
456991< 0.1%
 
354601< 0.1%
 
334131< 0.1%
 
395581< 0.1%
 
375111< 0.1%
 
Other values (90574)90574> 99.9%
 
ValueCountFrequency (%) 
01< 0.1%
 
11< 0.1%
 
21< 0.1%
 
31< 0.1%
 
41< 0.1%
 
ValueCountFrequency (%) 
908821< 0.1%
 
908811< 0.1%
 
908801< 0.1%
 
908791< 0.1%
 
908781< 0.1%
 

user_id
Real number (ℝ)

HIGH CORRELATION

Distinct count21983
Unique (%)24.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean16546.764726662546
Minimum-1
Maximum55746
Zeros0
Zeros (%)0.0%
Memory size707.7 KiB

Quantile statistics

Minimum-1
5-th percentile366.15
Q13437
median11032
Q327700
95-th percentile46368
Maximum55746
Range55747
Interquartile range (IQR)24263

Descriptive statistics

Standard deviation15273.36711
Coefficient of variation (CV)0.9230425017
Kurtosis-0.4539287823
Mean16546.76473
Median Absolute Deviation (MAD)9908
Skewness0.8247545475
Sum1498872136
Variance233275742.8
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
80517201.9%
 
68615981.8%
 
91912041.3%
 
110329661.1%
 
72908270.9%
 
45056610.7%
 
1834930.5%
 
9304580.5%
 
42534500.5%
 
33824250.5%
 
Other values (21973)8178290.3%
 
ValueCountFrequency (%) 
-12110.2%
 
51170.1%
 
612< 0.1%
 
72< 0.1%
 
81210.1%
 
ValueCountFrequency (%) 
557461< 0.1%
 
557441< 0.1%
 
557421< 0.1%
 
557381< 0.1%
 
557341< 0.1%
 

Reputation
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count965
Unique (%)1.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6282.395411993288
Minimum1
Maximum87393
Zeros0
Zeros (%)0.0%
Memory size707.7 KiB

Quantile statistics

Minimum1
5-th percentile1
Q160
median396
Q34460
95-th percentile37083
Maximum87393
Range87392
Interquartile range (IQR)4400

Descriptive statistics

Standard deviation15102.26867
Coefficient of variation (CV)2.403902919
Kurtosis13.43967443
Mean6282.395412
Median Absolute Deviation (MAD)390
Skewness3.574815757
Sum569084506
Variance228078519
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
145465.0%
 
631963.5%
 
1126442.9%
 
6527217201.9%
 
4415215981.8%
 
1613691.5%
 
8739312041.3%
 
2111721.3%
 
222759661.1%
 
370838270.9%
 
Other values (955)7134278.8%
 
ValueCountFrequency (%) 
145465.0%
 
212< 0.1%
 
34480.5%
 
41230.1%
 
526< 0.1%
 
ValueCountFrequency (%) 
8739312041.3%
 
6527217201.9%
 
4415215981.8%
 
370838270.9%
 
311704580.5%
 

Views
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct count361
Unique (%)0.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1034.2451757484766
Minimum0
Maximum20932
Zeros7040
Zeros (%)7.8%
Memory size707.7 KiB

Quantile statistics

Minimum0
5-th percentile0
Q15
median45
Q3514.25
95-th percentile5680
Maximum20932
Range20932
Interquartile range (IQR)509.25

Descriptive statistics

Standard deviation2880.074012
Coefficient of variation (CV)2.784711091
Kurtosis28.31236715
Mean1034.245176
Median Absolute Deviation (MAD)44
Skewness4.873839918
Sum93686065
Variance8294826.315
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
070407.8%
 
152095.8%
 
238194.2%
 
328893.2%
 
423302.6%
 
518172.0%
 
617261.9%
 
568017201.9%
 
735715981.8%
 
715331.7%
 
Other values (351)6090367.2%
 
ValueCountFrequency (%) 
070407.8%
 
152095.8%
 
238194.2%
 
328893.2%
 
423302.6%
 
ValueCountFrequency (%) 
2093212041.3%
 
73959661.1%
 
735715981.8%
 
69484500.5%
 
59272660.3%
 

UpVotes
Real number (ℝ≥0)

ZEROS

Distinct count330
Unique (%)0.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean734.3157180075951
Minimum0
Maximum11442
Zeros22186
Zeros (%)24.5%
Memory size707.7 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median22
Q3283
95-th percentile5007
Maximum11442
Range11442
Interquartile range (IQR)282

Descriptive statistics

Standard deviation2050.869327
Coefficient of variation (CV)2.792898581
Kurtosis14.31300098
Mean734.315718
Median Absolute Deviation (MAD)22
Skewness3.790593781
Sum66517255
Variance4206064.997
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
02218624.5%
 
136814.1%
 
228873.2%
 
319522.2%
 
703517201.9%
 
417051.9%
 
215615981.8%
 
615831.7%
 
513151.5%
 
712201.3%
 
Other values (320)5073756.0%
 
ValueCountFrequency (%) 
02218624.5%
 
136814.1%
 
228873.2%
 
319522.2%
 
417051.9%
 
ValueCountFrequency (%) 
114421870.2%
 
1127312041.3%
 
105234580.5%
 
86418270.9%
 
703517201.9%
 

DownVotes
Real number (ℝ≥0)

ZEROS

Distinct count76
Unique (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean33.27324913892078
Minimum0
Maximum1920
Zeros50205
Zeros (%)55.4%
Memory size707.7 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q38
95-th percentile143
Maximum1920
Range1920
Interquartile range (IQR)8

Descriptive statistics

Standard deviation134.9364354
Coefficient of variation (CV)4.05540303
Kurtosis98.77457755
Mean33.27324914
Median Absolute Deviation (MAD)0
Skewness8.757213426
Sum3014024
Variance18207.84159
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
05020555.4%
 
146285.1%
 
232503.6%
 
422262.5%
 
14319892.2%
 
319052.1%
 
618552.0%
 
518462.0%
 
8215981.8%
 
815621.7%
 
Other values (66)1952021.5%
 
ValueCountFrequency (%) 
05020555.4%
 
146285.1%
 
232503.6%
 
319052.1%
 
422262.5%
 
ValueCountFrequency (%) 
19202110.2%
 
77912041.3%
 
4122660.3%
 
3512910.3%
 
2144580.5%
 

post_id
Real number (ℝ≥0)

UNIQUE

Distinct count90584
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean56539.08052194648
Minimum1
Maximum115378
Zeros0
Zeros (%)0.0%
Memory size707.7 KiB

Quantile statistics

Minimum1
5-th percentile5315.15
Q126051.75
median57225.5
Q386145.25
95-th percentile110267.85
Maximum115378
Range115377
Interquartile range (IQR)60093.5

Descriptive statistics

Standard deviation33840.30753
Coefficient of variation (CV)0.5985294988
Kurtosis-1.231769907
Mean56539.08052
Median Absolute Deviation (MAD)30031
Skewness0.03591388347
Sum5121536070
Variance1145166414
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
40941< 0.1%
 
294031< 0.1%
 
518521< 0.1%
 
498051< 0.1%
 
559501< 0.1%
 
88491< 0.1%
 
149941< 0.1%
 
129471< 0.1%
 
27081< 0.1%
 
6611< 0.1%
 
Other values (90574)90574> 99.9%
 
ValueCountFrequency (%) 
11< 0.1%
 
21< 0.1%
 
31< 0.1%
 
41< 0.1%
 
51< 0.1%
 
ValueCountFrequency (%) 
1153781< 0.1%
 
1153771< 0.1%
 
1153761< 0.1%
 
1153751< 0.1%
 
1153741< 0.1%
 

Score
Real number (ℝ)

ZEROS

Distinct count128
Unique (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.7807670228737966
Minimum-19
Maximum192
Zeros19927
Zeros (%)22.0%
Memory size707.7 KiB

Quantile statistics

Minimum-19
5-th percentile0
Q11
median2
Q33
95-th percentile9
Maximum192
Range211
Interquartile range (IQR)2

Descriptive statistics

Standard deviation4.948921899
Coefficient of variation (CV)1.779696702
Kurtosis192.5905972
Mean2.780767023
Median Absolute Deviation (MAD)1
Skewness9.827873481
Sum251893
Variance24.49182796
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
12290125.3%
 
01992722.0%
 
21524816.8%
 
3990910.9%
 
462106.9%
 
541424.6%
 
628493.1%
 
719412.1%
 
813051.4%
 
99501.0%
 
Other values (118)52025.7%
 
ValueCountFrequency (%) 
-192< 0.1%
 
-131< 0.1%
 
-101< 0.1%
 
-92< 0.1%
 
-82< 0.1%
 
ValueCountFrequency (%) 
1921< 0.1%
 
1841< 0.1%
 
1641< 0.1%
 
1561< 0.1%
 
1521< 0.1%
 

ViewCount
Real number (ℝ≥0)

MISSING
SKEWED

Distinct count3654
Unique (%)8.7%
Missing48396
Missing (%)53.4%
Infinite0
Infinite (%)0.0%
Mean556.6561581492367
Minimum1.0
Maximum175495.0
Zeros0
Zeros (%)0.0%
Memory size707.7 KiB

Quantile statistics

Minimum1
5-th percentile19
Q153
median126
Q3367
95-th percentile2107.95
Maximum175495
Range175494
Interquartile range (IQR)314

Descriptive statistics

Standard deviation2356.930779
Coefficient of variation (CV)4.234087317
Kurtosis1135.338873
Mean556.6561581
Median Absolute Deviation (MAD)91
Skewness24.95697592
Sum23484210
Variance5555122.698
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
382950.3%
 
312930.3%
 
372770.3%
 
272770.3%
 
242740.3%
 
362720.3%
 
302700.3%
 
332620.3%
 
252620.3%
 
322610.3%
 
Other values (3644)3944543.5%
 
(Missing)4839653.4%
 
ValueCountFrequency (%) 
11< 0.1%
 
25< 0.1%
 
36< 0.1%
 
420< 0.1%
 
533< 0.1%
 
ValueCountFrequency (%) 
1754951< 0.1%
 
981091< 0.1%
 
926121< 0.1%
 
918481< 0.1%
 
881291< 0.1%
 

CommentCount
Real number (ℝ≥0)

ZEROS

Distinct count39
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.894650269363243
Minimum0
Maximum45
Zeros38051
Zeros (%)42.0%
Memory size707.7 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median1
Q33
95-th percentile7
Maximum45
Range45
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.638704141
Coefficient of variation (CV)1.392713042
Kurtosis12.44510758
Mean1.894650269
Median Absolute Deviation (MAD)1
Skewness2.574211733
Sum171625
Variance6.962759541
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
03805142.0%
 
11479816.3%
 
21252713.8%
 
378358.6%
 
455606.1%
 
536514.0%
 
626012.9%
 
717011.9%
 
811981.3%
 
98350.9%
 
Other values (29)18272.0%
 
ValueCountFrequency (%) 
03805142.0%
 
11479816.3%
 
21252713.8%
 
378358.6%
 
455606.1%
 
ValueCountFrequency (%) 
451< 0.1%
 
412< 0.1%
 
372< 0.1%
 
352< 0.1%
 
341< 0.1%
 

Body
Categorical

HIGH CARDINALITY
UNIFORM

Distinct count90350
Unique (%)> 99.9%
Missing220
Missing (%)0.2%
Memory size707.7 KiB
<p>So I am developing this application for rating books (think like IMDB for books) using relational database. </p> <p><strong>Problem statement :</strong></p> <p>Let's say book "<strong>A</strong>" deserves 8.5 in absolute sense. In case if A is the best book I have ever seen, I'll most probably rate it > 9.5 whereas for someone else, it might be just an average book, so he/they will rate it less (say around 8). Let's assume 4 such guys rate it 8.</p> <p>If there are 10 guys who are like me (who haven't ever read great literature) and they all rate it 9.5-10. This will effectively make it's cumulative rating greater than 9 (9.5*10 + 8*4) / 14 = 9.1</p> <p>whereas we needed the result to be 8.5 ... How can I take care of(normalize) this bias due to incorrect perception of individuals.</p> <p><strong>MyProposedSolution :</strong></p> <p>Here's one of the ways how I think it could be solved. We can have a variable <strong>Lit_coefficient</strong> which tells us how much knowledge a user has about literature. If I rate "<strong>A</strong>"(the book) 9.5 and person "<strong>X</strong>" rates it 8, then he must have read books much better than "<strong>A</strong>" and thus his Lit_coefficient should be higher. And then we can normalize the ratings according to the Lit_coefficient of user. Could there be a better algorithm/solution for the same?</p>
 
2
<p><a href="http://en.wikipedia.org/wiki/Proportional_hazards_models" rel="nofollow">Cox proportional hazards regression</a> is a very popular, semi-parametric method for survival analysis. </p> <p>It is semi-parametric in that the baseline hazard is left unspecified, but parameters for the effects of covariates are estimated. Eliminating the possibility of misspecifying the baseline makes the beta estimates more robust.</p> <p><em>Proportional hazards</em> means that no matter what the baseline hazard may be at any point in time, the ceteris paribus effect of a one-unit increase in a covariate is a constant multiple of the baseline hazard. </p>
 
2
<p>In a MCMC implementation of hierarchical models, with normal random effects and a Wishart prior for their covariance matrix, Gibbs sampling is typically used.</p> <p>However, if we change the distribution of the random effects (e.g., to Student's-t or another one), the conjugacy is lost. In this case, what would be a suitable (i.e., easily tunable) proposal distribution for the covariance matrix of the random effects in a Metropolis-Hastings algorithm, and what should be the target acceptance rate, again 0.234?</p> <p>Thanks in advance for any pointers.</p>
 
2
<p>So I'm looking to compare different combinations of features and classifiers. But I'm getting a lot of combinations that achieve 100% cross validation accuracy. I'm trying to figure out how I would compare the usefulness of each combination.</p> <p>For example I can both train an SVM using Features 1, 10, 15 to get 100% accuracy. But at the same time I can train a logistic regression classifier only using Feature 7 to get 100% accuracy. Also this is a binary classification problem.</p>
 
2
<p>Actually, <strong>frequent itemset mining</strong> may be a better choice than clustering on such data.</p> <p>The usual vector-oriented set of algorithms does not make a lot of sense. K-means for example will produce means that are no longer binary.</p>
 
2
Other values (90345)
90354
ValueCountFrequency (%) 
<p>So I am developing this application for rating books (think like IMDB for books) using relational database. </p> <p><strong>Problem statement :</strong></p> <p>Let's say book "<strong>A</strong>" deserves 8.5 in absolute sense. In case if A is the best book I have ever seen, I'll most probably rate it > 9.5 whereas for someone else, it might be just an average book, so he/they will rate it less (say around 8). Let's assume 4 such guys rate it 8.</p> <p>If there are 10 guys who are like me (who haven't ever read great literature) and they all rate it 9.5-10. This will effectively make it's cumulative rating greater than 9 (9.5*10 + 8*4) / 14 = 9.1</p> <p>whereas we needed the result to be 8.5 ... How can I take care of(normalize) this bias due to incorrect perception of individuals.</p> <p><strong>MyProposedSolution :</strong></p> <p>Here's one of the ways how I think it could be solved. We can have a variable <strong>Lit_coefficient</strong> which tells us how much knowledge a user has about literature. If I rate "<strong>A</strong>"(the book) 9.5 and person "<strong>X</strong>" rates it 8, then he must have read books much better than "<strong>A</strong>" and thus his Lit_coefficient should be higher. And then we can normalize the ratings according to the Lit_coefficient of user. Could there be a better algorithm/solution for the same?</p> 2< 0.1%
 
<p><a href="http://en.wikipedia.org/wiki/Proportional_hazards_models" rel="nofollow">Cox proportional hazards regression</a> is a very popular, semi-parametric method for survival analysis. </p> <p>It is semi-parametric in that the baseline hazard is left unspecified, but parameters for the effects of covariates are estimated. Eliminating the possibility of misspecifying the baseline makes the beta estimates more robust.</p> <p><em>Proportional hazards</em> means that no matter what the baseline hazard may be at any point in time, the ceteris paribus effect of a one-unit increase in a covariate is a constant multiple of the baseline hazard. </p> 2< 0.1%
 
<p>In a MCMC implementation of hierarchical models, with normal random effects and a Wishart prior for their covariance matrix, Gibbs sampling is typically used.</p> <p>However, if we change the distribution of the random effects (e.g., to Student's-t or another one), the conjugacy is lost. In this case, what would be a suitable (i.e., easily tunable) proposal distribution for the covariance matrix of the random effects in a Metropolis-Hastings algorithm, and what should be the target acceptance rate, again 0.234?</p> <p>Thanks in advance for any pointers.</p> 2< 0.1%
 
<p>So I'm looking to compare different combinations of features and classifiers. But I'm getting a lot of combinations that achieve 100% cross validation accuracy. I'm trying to figure out how I would compare the usefulness of each combination.</p> <p>For example I can both train an SVM using Features 1, 10, 15 to get 100% accuracy. But at the same time I can train a logistic regression classifier only using Feature 7 to get 100% accuracy. Also this is a binary classification problem.</p> 2< 0.1%
 
<p>Actually, <strong>frequent itemset mining</strong> may be a better choice than clustering on such data.</p> <p>The usual vector-oriented set of algorithms does not make a lot of sense. K-means for example will produce means that are no longer binary.</p> 2< 0.1%
 
<p>I understand that fuzzy clustering using FCM produces a membership matrix for the set of data points we feed to it. What characteristics will an anomalous cluster produced during this method have? (Considering I only have unlabelled data)</p> 2< 0.1%
 
Hidden Markov Models are used for modelling systems that are assumed to be Markov processes with hidden (i.e. unobserved) states.2< 0.1%
 
<p><a href="http://www.math.umass.edu/~lavine/Book/book.html">Introduction to Statistical Thought</a></p> 2< 0.1%
 
<p>I'm trying to improve a factory quality control.</p> <p>I have some variables from the melting process (something like ten control variables) that changes trough time (a matrix of the values of those control per minute), and in the end I have a quality score for the final product (one single variable). I have one of this for each production batch.</p> <p>I want to know if you guys can help me saying how can I look for correlations with the quality score inside that matrix.</p> <p>I know that I can look to each control variable alone, but those variables interfere with each other. So it is necessary to look at the sistem as a one.</p> <p>Thank you.</p> 2< 0.1%
 
<p>Given a i.i.d sample $X_{1},..,X_{n}$ of bernoulli random variables test 2 hypotheses $H_{0}:p=2/3$ and $H_{1}:p=1/3$. Bayesian prior is $\\pi(2/3)=1/3$ and $\\pi(1/3)=2/3$. Find the bayesian criterion for acceptng $H_{0}$, find the bayesian mean square error for the test and for $n=8$ compute this mean square error using normal approximation</p> <p>I have found the bayesian criterion for acceptance as $\\sum_{i=1}^{n}x_{i}{\\geq}\\frac{n+1-log_{2}(\\alpha^{-1}-1)}{2}$. where $\\alpha$ is a value is chosen prior to the test. How do you do the other two parts? </p> <p>Thanks</p> 2< 0.1%
 
Other values (90340)9034499.7%
 
(Missing)2200.2%
 

Length

Max length38847
Median length815
Mean length1128.987846
Min length3

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

df_indexuser_idReputationViewsUpVotesDownVotespost_idScoreViewCountCommentCountBody
00-1105007192021750NaN0<p><strong>CrossValidated</strong> is for statisticians, data miners, and anyone else doing data analysis or interested in it as a discipline. If you have a question about</p>\n\n<ul>\n<li><strong>statistical analysis</strong>, applied or theoretical</li>\n<li><strong>designing experiments</strong></li>\n<li><strong>collecting data</strong></li>\n<li><strong>data mining</strong></li>\n<li><strong>machine learning</strong></li>\n<li><strong>visualizing data</strong></li>\n<li><strong>probability theory</strong></li>\n<li><strong>mathematical statistics</strong></li>\n<li>statistical and data-driven <strong>computing</strong></li>\n</ul>\n\n<p>then you're in the right place. Anybody can ask a question, regardless of skills and experience, but some questions are still better than others. If you came here with a question to ask and are new to the site, please consult our thread on <a href="http://meta.stats.stackexchange.com/questions/1479/how-to-ask-a-good-question-on-crossvalidated">how to ask a good question</a>.</p>\n\n<p>Our community aims to create a lasting record of great solutions to questions. For more about this and guidance about how to provide your own great answers, please read <a href="http://meta.stats.stackexchange.com/questions/1390/how-should-questions-be-answered-on-cross-validated">How should questions be answered on Cross Validated?</a>. Providing references to peer-reviewed literature or links to on-line resources is warmly welcomed. You can also incorporate the work of others under <a href="http://en.wikipedia.org/wiki/Fair_use" rel="nofollow">fair use doctrine</a>, which particularly means that you <em>must</em> attribute any text, images, or other material that is not originally yours.</p>\n\n<p><strong>Homework</strong> questions are welcome. <em>Please mark them with the <a href="http://stats.stackexchange.com/questions/tagged/homework">homework</a> tag</em>. They get <a href="http://meta.stackexchange.com/questions/10811/how-to-ask-and-answer-homework-questions/10812#10812">somewhat special treatment</a>, because ultimately you benefit most by finding the solution <em>yourself.</em> The community will try to provide <a href="http://meta.stats.stackexchange.com/q/12/919">guidance, hints, and useful links</a>.</p>\n\n<p><em>There are certain subjects that will probably get better responses on our sister sites</em>. If your question is about</p>\n\n<ul>\n<li><strong>Programming</strong>, ask on <a href="http://stackoverflow.com">Stack Overflow</a>. If the language is statistically oriented (such as <strong>R</strong>, <strong>SAS</strong>, <strong>Stata</strong>, <strong>SPSS</strong>, etc.), then decide based on the nature of your question: if it needs <em>statistical expertise</em> to understand or answer, ask it here; if it's about an <em>algorithm</em>, routine <em>data processing</em>, or details of the <em>language</em>, then please refer to the <a href="http://meta.stats.stackexchange.com/questions/793/internet-support-for-statistics-software">collection of links to resources</a> we maintain.</li>\n<li><strong>Mathematics</strong>, ask on <a href="http://math.stackexchange.com">math.stackexchange.com</a>.</li>\n<li><strong>Bugs in software</strong>, ask the people who produced the software.</li>\n</ul>\n\n<p>Questions about <strong>obtaining particular datasets</strong> are off-topic (they are too specialized). The <a href="http://gis.stackexchange.com">GIS site</a> welcomes inquiries about obtaining geographically related datasets.</p>\n\n<p>Please note, however, that <em>cross-posting is not encouraged</em> on SE sites. Choose one best location to post your question. Later, if it proves better suited on another site, it can be <em>migrated</em>.</p>\n
11-1105007192085760NaN0NaN
22-1105007192085780NaN0NaN
33-1105007192089810NaN0<p>"Statistics" can refer variously to the (wide) field of statistical theory and statistical analysis; to constructing functions of data as used in formal procedures; to collections of data; and to summaries of data.</p>\n\n<p>Because this site is about statistics and statistical analysis, it is rare that tagging a question with "statistics" will be informative. Use of this tag will signal that your question is extremely general and broad.</p>\n
44-1105007192089820NaN0This generic tag is only rarely suitable; use it with caution. Consider selecting more specific, descriptive tags.
55-1105007192098570NaN0NaN
66-1105007192098580NaN0Linear regression is a type of regression when regression function is linear. It is most widely used regression type.
77-1105007192098600NaN0NaN
88-11050071920101300NaN0NaN
99-11050071920101310NaN0NaN

Last rows

df_indexuser_idReputationViewsUpVotesDownVotespost_idScoreViewCountCommentCountBody
90574908735572416100115335319.01<p>I have a set of objects, each of which can be assigned to another object within the set in a many-to-one, directional assignment, like a vote. Objects cannot be assigned to themselves reflexively, but they can be in an unassigned state ("not voting"). So for example, the set <code>{A, ..., Z}</code> plus relation could be in the following state:</p>\n\n<pre><code>A -&gt; E\nE -&gt; C\nB -&gt; C\nC -&gt; B\nD -&gt; Y\n{F, ..., Z} are not voting.\n</code></pre>\n\n<p>This state will change over time, in discrete steps, one vote at a time. For example, at time <code>t = 0</code> the set could be as above. Then at <code>t = 1</code>, <code>Z</code> votes for <code>C</code> (<code>Z -&gt; C</code>), resulting in the state:</p>\n\n<pre><code>A -&gt; E\nE -&gt; C\nB -&gt; C\nC -&gt; B\nD -&gt; Y\nZ -&gt; C\n{F, ..., Y} are unassigned.\n</code></pre>\n\n<p>Can anyone come up with a neat way to illustrate graphically this set and the state changes over time? Ideally, it should be evident at a glance</p>\n\n<ul>\n<li>How many votes there are for one object at a given time.</li>\n<li>Which objects are voting for another given object.</li>\n</ul>\n\n<p>Thanks!</p>\n
905759087455729161001153380NaN0<p>All items are not solved fully correctly in the question. I would recommend the following.</p>\n\n<p>(0) The observations "y" do not need to be corrected as they are between 0 and 1 already. Applying the correction shouldn't create problems but it's not necessary either.</p>\n\n<p>(1) cannot be answered by the likelihood ratio (LR) test. Generally in mixture models, the selection of the number of components cannot be based on the LR test because its regularity assumptions are not fulfilled. Instead, information criteria are often used and "flexmix" upon which betamix() is based offers AIC, BIC, and ICL. So you could choose the best BIC solution among 1, 2, 3 clusters via</p>\n\n<pre><code>library("flexmix")\nset.seed(0)\nm &lt;- betamix(y ~ 1 | 1, data = d, k = 1:3)\n</code></pre>\n\n<p>(2) The parameters in betamix() are not mu and phi directly but additionally link functions are employed for both parameters. The defaults are logit and log, respectively. This ensure that the parameters are in their valid ranges (0, 1) and (0, inf), respectively. One could refit the models in both components to get easier access to the links and inverse links etc. However, here it is probably easiest to apply the inverse links by hand:</p>\n\n<pre><code>mu &lt;- plogis(coef(m)[,1])\nphi &lt;- exp(coef(m)[,2])\n</code></pre>\n\n<p>This shows that the means are very different (0.25 and 0.77) while the precisions are rather similar (49.4 and 47.8). Then we can transform back to alpha and beta which gives 12.4, 37.0 and 36.7, 11.1 which is reasonably close to the original parameters in the simulation:</p>\n\n<pre><code>a &lt;- mu * phi\nb &lt;- (1 - mu) * phi\n</code></pre>\n\n<p>(3) The clusters can be extracted using the clusters() function. This simply selects the component with the highest posterior() probability. In this case, the posterior() is really clear-cut, i.e., either close to zero or close to 1.</p>\n\n<pre><code>cl &lt;- clusters(m)\n</code></pre>\n\n<p>(4) When visualizing the data with histograms, one can either visualize both components separately, i.e., each with its own density function. Or one can draw one joint histogram with the corresponding joint density. The difference is that the latter needs to factor in the different cluster sizes: the prior weights are about 1/3 and 2/3 here. The separate histograms can be drawn like this:</p>\n\n<pre><code>## separate histograms for both clusters\nhist(subset(d, cl == 1)$y, breaks = 0:25/25, freq = FALSE,\n col = hcl(0, 50, 80), main = "", xlab = "y", ylim = c(0, 9))\n\nhist(subset(d, cl == 2)$y, breaks = 0:25/25, freq = FALSE,\n col = hcl(240, 50, 80), main = "", xlab = "y", ylim = c(0, 9), add = TRUE)\n\n## lines for fitted densities\nys &lt;- seq(0, 1, by = 0.01)\nlines(ys, dbeta(ys, shape1 = a[1], shape2 = b[1]),\n col = hcl(0, 80, 50), lwd = 2)\nlines(ys, dbeta(ys, shape1 = a[2], shape2 = b[2]),\n col = hcl(240, 80, 50), lwd = 2)\n\n## lines for corresponding means\nabline(v = mu[1], col = hcl(0, 80, 50), lty = 2, lwd = 2)\nabline(v = mu[2], col = hcl(240, 80, 50), lty = 2, lwd = 2)\n</code></pre>\n\n<p>And the joint histogram:</p>\n\n<pre><code>p &lt;- prior(m$flexmix)\n hist(d$y, breaks = 0:25/25, freq = FALSE,\n main = "", xlab = "y", ylim = c(0, 4.5))\nlines(ys, p[1] * dbeta(ys, shape1 = a[1], shape2 = b[1]) +\n p[2] * dbeta(ys, shape1 = a[2], shape2 = b[2]), lwd = 2)\n</code></pre>\n\n<p>The resulting figure is included below.</p>\n\n<p><img src="http://i.stack.imgur.com/7dcW4.png" alt="enter image description here"></p>\n
9057690875557301000115340018.01<p>I am working on decision trees for the first time at job. I have done lot of research on CHAID and CART algorithms but find different answers to a very simple question given below :</p>\n\n<p><strong>What kind of target variables CART can have?</strong></p>\n\n<p>I understand that CART can help both in prediction and classification. I confirm that the target variable for regression tree is continuous. I am getting different answers in various research papers with regards to the target variable of classification tree. Can someone please help me with this?</p>\n\n<p>Further, I need to do some analysis on the following :</p>\n\n<p>1)Identifying frauds : I intend to use classification trees of CART/CHAID/logistic regression\n2) Forecasting losses : I intend to use linear regression or regression trees of CART\n3) Identifying cheque bounce customers : I intend to use classification trees OF CART/CHAID or logistic regression</p>\n\n<p>Kindly suggest if this is the right way to go....</p>\n\n<p>thanks\nquants_mum</p>\n
905779087655731100011535003.00<p>How do we specify negative costs in rpart? The documentation says the diagonals of the loss matrix should be zero. Is there an alternative to specify the benefits of correct classification (that is, the negative cost)?</p>\n
9057890877557336100115356115.00<p>I'm a beginner in statistics and I have to run multilevel logistic regressions. I am confused with the results as they differ from logistic regression with just one level. </p>\n\n<p>I don't know how to interpret the variance and correlation of the random variables. And I wonder how to compute the ICC.</p>\n\n<p>For example : I have a dependent variable about the protection friendship ties give to individuals (1 is for individuals who can rely a lot on their friends, 0 is for the others). There are 50 geographic clusters of respondant and one random variable which is a factor about the social situation of the neighborhood. Upper/middle class is the reference, the other modalities are working class and underprivileged neighborhoods. </p>\n\n<p>I get these results :</p>\n\n<pre><code>&gt; summary(RLM3)\nGeneralized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']\n Family: binomial ( logit )\nFormula: Arp ~ Densite2 + Sexe + Age + Etudes + pcs1 + Enfants + Origine3 + Sante + Religion + LPO + Sexe * Enfants + Rev + (1 + Strate | \n Quartier)\n Data: LPE\nWeights: PONDERATION\nControl: glmerControl(optimizer = "bobyqa")\n\n AIC BIC logLik deviance df.resid \n 3389.9 3538.3 -1669.9 3339.9 2778 \n\nScaled residuals: \n Min 1Q Median 3Q Max \n-3.2216 -0.7573 -0.3601 0.8794 2.7833 \n\nRandom effects:\n Groups Name Variance Std.Dev. Corr \n Neighb. (Intercept) 0.2021 0.4495 \n Working Cl. 0.2021 0.4495 -1.00 \n Underpriv. 0.2021 0.4495 -1.00 1.00\nNumber of obs: 2803, groups: Neigh., 50\n\nFixed effects:\n</code></pre>\n\n<p>The differences with the "call" part is due to the fact I translated some words.</p>\n\n<p>I think I understand the relation between the random intercept and the random slope for linear regressions but it is more difficult for logistics ones. I guess that when the correlation is positive, I can conclude that the type of neighborhood (social context) has a positive impact on the protectiveness of friendship ties, and conversely. But how do I quantify that ?</p>\n\n<p>Moreover, I find it odd to get correlation of 1 or -1 and nothing more intermediate.</p>\n\n<p>As for the ICC I am puzzled because I have seen a post about lmer regression that indicates that intraclass correlation can be computed by dividing the variance of the random intercept by the variance of the random intercept, plus the variance the random variables, plus the residuals. </p>\n\n<p>But there are no residuals in the results of a glmer. I have read in a book that ICC must be computed by dividing the random intercept variance by the random intercept variance plus 2.36 (pi²/3). But in another book, 2.36 was replaced by the inter-group variance (the first level variance I guess). \nWhat is the good solution ?</p>\n\n<p>I hope these questions are not too confused.\nThank you for your attention !</p>\n
9057990878557341000115352016.00<p>For example, I was looking at <a href="http://en.wikipedia.org/wiki/10-second_barrier" rel="nofollow">this list of the 93 people</a> who have broken the "10-second-barrier", after reading that sprinter Christophe Lemaitre was the first person of purely European decent to break the barrier, which got me to wondering what the difference between the mean sprinting times for whites vs blacks. Unfortunately, that number is probably not known since it would require making thousands of average people sprint, and even then it wouldn't necessarily reflect the "true genetic" difference, since the people sprinting were not training for sprinting. So if you wanted to measure the genetic component of the difference, it might be more accurate to measure only the fastest people in the world who are equally motivated and have been training for years, and therefore have eliminated the non-genetic disadvantages, at least that would be my theory.</p>\n\n<p>So if you could get a list of say the top 1000 fastest times in the 100 meter dash, and say 20 people on that list are white, could you use that data to give some estimation on what the full distributions look like and/or find what the mean of those distributions are? How?</p>\n\n<hr>\n\n<p>QUESTION is above^^ this is just some rambling:</p>\n\n<p>I would guess that if you were trying to find the difference in means between blacks and whites 100 meter sprinting times AFTER everyone in each population had trained for years, lost excess weight ect, i.e. you only want to measure the genetic difference in maximum potential, then measuring average people will not be the way to go, since none of them will have trained to reach their maximum potential, thus there will be many non-genetic factors causing differences, and the other problem is that the difference between trained and untrained, may not be the same, so if you want to measure difference in max potential, it would be better to look at the 1000 fastest, rather than 1000 average people. Also, the data for the 1000 fastest is very high quality, since it was done with laser timing and under official supervision, whereas data gathered from some fitness survey done at a few high schools would probably be of low quality.</p>\n
90580908795573811000115360240.04<p>Is Student's t test a Wald test?</p>\n\n<p>I've read the description of Wald tests from Wasserman's <em>All of Statistics</em>.</p>\n\n<p>It seems to me that the Wald test includes t-tests. Is that correct? If not, what makes a t-test not a Wald test?</p>\n
9058190880557426000115366117.00<p>Does any standard statistical software like R, SAS or SPSS have procedures or codes to analyze log-linear models for missing data in contingency tables using maximum likelihood estimation (or EM algorithm or other iterative procedures), not multiple imputation techniques ? </p>\n
9058290881557446100115370113.02<p>im analyzing an article for my studies with the hypothesis if a change in work motivation ist related with a change in mental well being (<a href="http://www.sciencedirect.com/science...01879113001541" rel="nofollow">http://www.sciencedirect.com/science...01879113001541</a>). Sadly i dont know much about poisson regression. The follow up measurement was 18 month later. Do you consider always the time, when you do a poisson regression?Im not quite sure if they did so in this study.. If i imagine a graph of this regression, what can i see on the x and what on the y axis? Thanks for your help</p>\n
90583908825574610610011537615.02<p>My goal is to create a formula that can give an indication of how a YouTube channel's video will perform in the first 30 days of its lifespan and eliminate viral video / "lightening in a bottle" outliers that may be on the channel. The goal is to use the resulting number to price a video from a specific YouTube channel. </p>\n\n<p>As an example, a hypothetical YouTube channel uploads approx 10 videos a month.\nVariables: </p>\n\n<ol>\n<li>Some videos get shared more and are more "viral"</li>\n<li>Videos have a "fat head" and "long tail." Fat head refers to the largest chunk of viewership which in the case of established YouTube channels happens upfront, and the long tail refers to accumulated views over succeeding months.</li>\n</ol>\n\n<p>These viewcounts belong to videos that the same channel uploaded in the last 30 days (from most recent in descending order): </p>\n\n<pre><code> 351,170 \n 770,783 \n1,183,166 \n 154,645 \n1,568,569 \n2,564,857 \n1,023,498 \n1,409,113 \n1,006,203 \n1,244,092 \n</code></pre>\n\n<p>So my questions: </p>\n\n<ol>\n<li>Is there a formula I could plug in to my spreadsheet given this data that could accurately come up with a conservative estimation of how the video will perform in the first 30 days? </li>\n<li>If not, how can I create one? </li>\n<li>Because some of these videos are still generating a "fat head" (like the most recently published video with 351,170 views) would it make sense to instead gather and average videos uploaded in the last 30-60 days instead? (fat head has time to impact viewcount and settle)</li>\n</ol>\n